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Abstract 

The group membership prediction (GMP) problem in¬ 
volves predicting whether or not a collection of instances 
share a certain semantic property. For instance, in kinship 
verification given a collection of images, the goal is to pre¬ 
dict whether or not they share a familial relationship. In this 
context we propose a novel probability model and introduce 
latent view-specific and view-shared random variables to 
jointly account for the view-specific appearance and cross¬ 
view similarities among data instances. Our model posits 
that data from each view is independent conditioned on 
the shared variables. This postulate leads to a parametric 
probability model that decomposes group membership like¬ 
lihood into a tensor product of data-independent parame¬ 
ters and data-dependent factors. We propose learning the 
data-independent parameters in a discriminative way with 
bilinear classifiers, and test our prediction algorithm on 
challenging visual recognition tasks such as multi-camera 
person re-identification and kinship verification. On most 
benchmark datasets, our method can significantly outper¬ 
form the current state-of-the-art. 

1. Introduction 

Visual similarity plays an important role in visual recog¬ 
nition in object detection and scene understanding HU El. 
A visual similarity function returns a score of how likely 
two instances {e.g. images and videos) share similar seman¬ 
tic concepts {e.g. persons, cars, etc.). With this perspective 
we propose the Group Membership Prediction (GMP) prob¬ 
lem, where the goal is to determine how likely a collection 
of distinct items share the same semantic property. Fig. 
depicts the idea of the GMP problem for two visual recogni¬ 
tion tasks, i.e. person re-identification and kinship verifica¬ 
tion. In person re-identification (Re-ID) we are given a col¬ 
lection of images of persons captured from multiple views 
(cameras) and the goal is to detect whether or not they be¬ 
long to the same person. In applications such as kinship 
detection, the underlying semantic property is more gen¬ 
eral, and the goal is to predict whether or not a collection of 



(a) Person re-identification (b) Kinship verification 


Figure 1. Illustration of group membership prediction (GMP) in the visual 
recognition tasks of (a) person re-identification and (b) kinship verification. 
Here we would like to predict (a) whether the four pedetrain images are 
taken from the same person, and (b) whether the face images are from the 
same family. These images are borrowed from (a) VIPeR dataset na and 
(b) Family 101 dataset (S), respectively. 

images share di familial relationship. GMP poses significant 
challenges on account of large variations in data including 
lighting conditions, poses and camera views. 

We introduce a novel parametric probability model for 
predicting group membership. Our key insight is that al¬ 
though the visual appearances can significantly vary, they 
share a set of latent variables common to all views. As de¬ 
picted in Fig. we can hypothesize “body parts” as shared 
latent variables for all the pedestrian images, while for kin¬ 
ship verification “facial landmarks” could be considered as 
the shared latent variables. Our model postulates that con¬ 
ditioned on the location of each shared latent variable (body 
part or facial landmark) the visual appearance at that loca¬ 
tion is conditionally independent for different views. This 
property leads to a natural way of measuring image similar¬ 
ities through comparison of visual similarities of the same 
shared latent variables across different views. 

This postulate leads us to a joint parametric probabil¬ 
ity model that consists of view-specific and view-shared 
random variables. View-specific variables account for vi¬ 
sual characteristics within a view while view-shared vari¬ 
ables account for the integration of multi-view information. 
The group membership likelihood factorizes into a tensor 
product consisting of data-independent and data-dependent 
factors. We learn the data-independent parameters (i.e. 
weights) discriminatively using bilinear classifiers. Finally 
we marginalize these data tensors over all the dimensions 
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(a) Body parts (b) Facial landmarks 


Figure 2. Illustration of (a) body parts (e.g. head, torso, legs) for Re-ID 
and (b) facial landmarks (e.g. eyes, nose, mouth) for kinship verification. 
Note that in these aligned images, these body parts or facial landmarks 
approximately coincide in terms of spatial locations. 

with the learned weights as the group membership scores. 
Our experimental results on multi-camera person Re-ID and 
kinship verification demonstrate the good prediction perfor¬ 
mance and computational efficiency of our method. 

1.1. Related Work 

GMP problem is closely related to multi-view learning 
(MVL). Indeed, our perspective of shared variables has 
been used before in the context of MVL ifT^I^ISOl . Nev¬ 
ertheless, the goal of MVL specifically in visual recognition 
is different from ours. Namely, the objective of MVL is to 
leverage multiple sources (e.g. texts, images, videos, etc.) 
of data corresponding to the same underlying object (e.g. 
persons, events, etc.) to improve recognition performance 
El [H EH [30l. On the other hand our goal is to predict 
group membership among the multiple sources. 

Person Re-ID essentially is a GMP problem, where each 
camera view can be taken as one of the instances. In the lit¬ 
erature, however, most of existing works consider this prob¬ 
lem as an independent two-view classification task, mainly 
focusing on cleverly designing local features MMM 
[STl | 36 l or learning better metrics (TS] [161 (HI (HI Ell EZl- 
Recently, Figueira et al. 112 proposed a semi-supervised 
learning method to fuse multi-view features for Re-ID so 
that the features agree on the classification results. Das 
et al. E) considered the group membership prediction in 
Re-ID by maximizing the summation of pairwise similar¬ 
ity scores using binary integer programming during testing. 
Unlike 0, we formulate the group membership problem 
as a learning problem, rather than a post-processing step to 
improve the matching rate. 

Kinship verification is indeed another GMP problem, 
where each family role (e.g. father, mother, son, daugh¬ 
ter, etc.) can be considered as an instance. Similar to per¬ 
son Re-ID, existing works mainly focus on learning better 
features lam and better distance metrics 1^ for pair¬ 
wise classification 1^ . Recently, Qin et al. (2M pro¬ 
posed a bilinear model to handle so-called tri-subject kin¬ 
ship verification problems. Fang et al. lO proposed a sparse 
group lasso based feature selection method to determine 
whether a query person is from a specific family. Unlike 


IHlllSl, our method targets at a more general and challeng¬ 
ing problem which can be used to predict an arbitrary num¬ 
ber of images with a fixed structure of family roles, such as 
father-son, father-mother-daughter, grandfather-father-son- 
grandson, etc. 

2. Our Method 

2.1. Problem Setting 

Let {(^rni ym)}m=i,--- ,M be a group of M persons from 
different views, where Vm, Xm denotes the person and 
Hm denotes its label (e.g. identity or family). Let Vn = 
{1, • • • , A^rn}: ^ bc the image for the per¬ 

son with Njn images in total. The goal of our method is to 
predict the following probability as group membership: 

p{yi = • • • = • • • , ^ m )- ( 1 ) 

Note that our problem setting is naturally applicable to 
the multiple instance cases. For example, during learning 
we allow multiple images to be associated with a person (i.e. 
Xm = in person Re-ID and kinship verification, as 

in the CUHK Campus [351 and Family 101 IH datasets. 

While we have motivated our approach in the context 
of shared latent variables (body parts or facial landmarks), 
this information is unavailable during the training or test¬ 
ing phases. Furthermore, estimating locations of body parts 
and facial landmarks is known to be extremely challenging 
||2l[38l. Fortunately, in the context of the applications and 
problems that we are concerned with, the images are ap¬ 
proximately aligned. In these images, foreground objects 
are centralized and well cropped. Currently most bench¬ 
mark datasets are composed of such approximately aligned 
images, namely, the same body parts or facial landmarks 
appear roughly at similar locations. In such cases, pixel 
locations provide good approximation of where body parts 
and facial landmarks are, and we utilize this property to by¬ 
pass the detection challenge, while accounting for spatial 
misalignments with spatial kernels. Note that the issue of 
visual ambiguity of the shared variables still remains in our 
problem. 

2.2. Parametric GMP Model 

We introduce two latent variables to model the relation¬ 
ship between the class labels {ym} and data samples {X^}. 
The graphical representation of our parametric probabil¬ 
ity model is shown in Fig. [^a), where Vm, Zm denotes 
the view-specific latent variable for view m, h denotes the 
view-shared latent variable, and Nm denotes the number of 
images from view m. Based on this model, we can factorize 





Vi = yj 



(a) Parametric probability model 


(b) Pairwise decomposition 


Figure 3. (a) Graphical representation of our parametric probability model 
for GMR (b) Pairwise decomposition of our model in (a). 


our group membership score as follows: 
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Model interpretation. To show the intuition of our para¬ 
metric probability model, we consider the person Re-ID ex¬ 
ample in Fig. I^a) in more detail. In the Re-ID problem 
the view-specific latent variables {zm} can be thought of as 
visual appearances of body parts of different persons, and 
the view-shared latent variable h can be considered as these 
body parts which are shared among all the persons. 

Then using Bayes rule we can expand Eq. In 

particular, for the two-view Re-ID problem we see that 
the group membership score of the image pair (xi,^i) 
and (x 2 , 2 / 2 ) as ^(j/i = 2 / 2 |xi,X 2 ) = = 

^ 2 1^1, ^ 2 )p(xi| 2 :i, h)p(^ 2 \z 2 ^ h)p{h). Since visual appear¬ 
ances in zi (or Z 2 ) are posited to be independent given im¬ 
age xi (or X 2 ) and the parts h, we can predict whether 
or not yi is equal to ^2 {i^e. p{yi = ^ 2 |xi,X 2 )) by 
marginalizing the similarities of corresponding visual fea¬ 
tures of each individual part in both images {i.e. p(xi |^i, h) 
and p(x 2 | 2 ; 2 , h)) with some data-independent weights (i.e. 
p{yi = ^21^1, ^ 2 ) and p(h)). Similarly for the kinship ex¬ 
ample in Fig. [^b) we can infer the group membership score 
by marginalizing the corresponding landmark similarities. 

We take these data-independent weights as the model pa¬ 
rameters for prediction, which are learned discriminatively. 


2.3. Discriminative Learning of Model Parameters 

2.3.1 Co-occurrence Tensor Representation 

As discussed in Section |2.1| images are approximately 
aligned in the related applications. Specifically, in person 
Re-ID benchmarks the head is always located at the top of 
images, torso in the middle, and legs at the bottom. This 
typical structure has been exploited in designing discrimi¬ 
native features Go). Therefore, with approximately aligned 
images we can bypass the problem of shared variable detec¬ 
tion and directly utilize pixel locations as surrogates for lo¬ 
cations of body parts or facial landmarks. Note that we can 
still allow small spatial misalignments by designing kernels 
to account for spatial distortions. 

Recently, Zhang et al. fSH proposed an interesting fea¬ 
ture representation to handle visual ambiguity and spatial 
distortion in images for person re-id. The basic idea in their 
method is to capture visual ambiguity using visual words, 
and match them at similar locations using distance trans¬ 
form to handle spatial distortion. This results in a visual 
word co-occurrence matrix for a pair of images. 

Inspired by 1^ . we propose a visual word co¬ 
occurrence representation using p( 2 ;^|x^^^, h) from 

multiple views to represent the group of data samples. Their 
proposed Gaussian kernel 13^ is computationally cumber¬ 
some. Instead we design a truncated exponential function 
as the spatial kernel k, with an arbitrary distance function 
inside to improve flexibility and computational efficiency. 

Let e U{zm, Xm,n) be a pixel location where the 
corresponding pixel in image yim,n is encoded using visual 
word Zm, and iZh be the pixel location with index h. Then 
we define p(^^ Ix^^^, h) in Eq. |^as follows: 
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where d(-^-) denotes a distance function, > 0 de¬ 
notes a predefined window size parameter for view m, 
and (a > 0 is a predefined spatial scale parameter. Then 
if we take view-specific and view-shared latent variables 
as the dimensions in the tensor to represent the group of 
data, the entry at index ( 2 ^ 1 , • • • ^Zm^h) can be calculated as 

0^=1 \ltEn=lPi^m\^m,n,h) . 


2.3.2 General Learning Formulation 

Here we introduce additional notations to simplify our ex¬ 
position. Rather than directly representing a group of data 
samples * * * , ^m} as a tensor, we con¬ 

vert it into a matrix (j)(Xi^... ^m) ^ Mrim=i \zm\^\h\ 
mensions Y\m=i 1^1’ respectively, where Vm, \zm\ 


























and \h\ denote the numbers of visual words for view m and 
pixel locations in images. Further, we denote Wz = p{yi = 

• • • = yM\zi, • • • ,zm) ^ and Wh = p{h) e 

as our model parameters in the form of vectors. Then 
our group membership score in Eq. [^can be rewritten as a 
decision function f as follows: 

,m) = Avf (4) 

where (•)^ denotes the matrix transpose operator. If 
> 0, we expect that all the members in the 
group have the same class label (and do not otherwise). 

Let ,M)}fe=i.--- ,N be a set of N training 

(k) 

data groups from M views, where ^ = 1 if all 

the class labels in group k are the same (and —1 otherwise). 
Due to the specific form in Eq. we propose learning bilin¬ 
ear classifiers {i.e. Wz and for GMP inspired by ll27]| . 
which used bilinear classifiers in a different context (binary 
classification): 
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where ^(^ •) denotes the loss function (e.g. hinge loss), 
Ai > 0, A 2 > 0 are predefined regularization parameters, 
and II • II 2 denotes the ^ 2 -norm of a vector. 

Note that here we relax the probability constraint on Wz 
and Wh to real numbers so that Eq. can be efficiently 
solved using alternating optimization. In each iteration, we 
fix one parameter {i.e. Wz or Wh) and use a standard support 
vector machine (S VM) solver to find the other parameter so 
that the objective value decreases monotonically, thus guar¬ 
anteeing a local optimal solution. 


Algorithm 1 Pairwise decomposition based learning 

Input • {l/mi }vm^ 

Ai, A 2 , A 3 > 0 

Output : 

Initialize /3 ^ 1, w/, ^ 1, Wmi,mj ^ 1; 

repeat 

Solve {wmi,mj} in Eq. ^ (multi-view training) or Eq. 
(double-view training) by fixing f3 and Wh ; 

Solve Wh in Eq. ^or Eq^by fixing /3 and {wmi,mj } ; 
Solve /3 in Eq. Is or Eq. I^y fixing {wm^ ,mj} and Wh. 

until Converge’, 

return {Wmi,mj}mi^mje{l,...,M},Wh, ^ 


where p( 2 /i = ■■■ = yMlVnn = 2 /m,) indicates how 
importantly the pair of views rui and ruj contribute to 
GMP. In this way, the number of parameters that need 
to be learned in our method is significantly reduced from 

(^nm=l \^rn\ + |^|^ tO l^mJl^m^l + 1^1^ 

Let q^)(A^772i ) — P^^rrii \Vmi,h)p{ Zrrij \Xra,.h) e 

^i\zmi\\zmj\)x\h\ pairwise visual word matrix be¬ 

tween views rUi and ruj, where P^rni,mj = 

Also let Wmi,m,- =P{ymi = ymj\Zmi,Zmj) S 

, and /3 = p{yi = ■■■ = yMlVnn = 2/m,) e . 

Then based on Eq. we can rewrite Eq. [^as follows: 

^ ^ Pmi,mj^rni,rrij^{^'rni,rnj)^h’) d) 
mi^rrij 

where Prni,mj denotes the entry in /3 for the view pair. 

To learn our model parameters in Eq. we propose two 
learning methods as follows, namely, multi-view training 
and double-view training’. 


2.3.3 Pairwise Decomposition Approximation 

With sufficient training data, we can train a bilinear classi¬ 
fier directly using Eq. This training method, however, 
does not scale well with the number of views due to the 
high dimensional tensor representation, leading to serious 
computational and overfitting issues. 

To overcome these issues, we propose an approximate 
pairwise decomposition method, as illustrated in Eig. [^b), 
to reduce the parameter space. This is based on the condi¬ 
tional independence assumption in multi-view learning til . 
Accordingly, we can rewrite our group membership score 
in Eq.j^as follows: 

p{yi = • • • = ^m|^i, • • • ,^m) (6) 

= ^ ^ ^ ^ p{yi = • • • = yivtlymi — ymj) 
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Double-view training: 
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where V/c, yrni,mj = 1 if in group k the labels of the two 
persons = ym^ holds; otherwise, 0. Here, > denotes 







an element-wise > operator. Both training can be done us¬ 
ing alternating optimization with a standard SVM solver. 
Still local optima are guaranteed. For two-view scenarios, 
both training methods are essentially identical, and scale 
quadratically with the number of views, in general. Lin¬ 
ear scalability is also possible if we organize all the views 
as cycle graphs. Difference in these two training methods 
comes from the loss functions, where in multi-view training 
^ measures the group {i.e. multi-view) loss, while in double¬ 
view training I measures the pair-view loss. Our algorithm 
is summarized in Alg. 

3. Experiments 

We evaluate our method on person Re-ID and kinship 
verification tasks along with state-of-the-art methods on 
benchmark datasets. Standard training/testing protocols are 
used in all experiments. For each comparing method, we ei¬ 
ther cite the original results from the papers (denoted by (•)* 
in the tables) or calculate from released codes. Our results 
are reported as the average over 3 trials. 

For each experiment, we choose the same or similar low- 
level feature as the other methods (see the details in subsec¬ 
tion) for fair comparison. We densely sample the images to 
generate a low-level local feature per pixel. Then we use K- 
Means to build the visual vocabularies with about 2 x 10^ 
randomly selected features per view. Further, every local 
feature is quantized into one of these visual words based 
on Euclidean distance. Note that more complicated feature 
selection methods may be employed to yield better perfor¬ 
mance, but we do not fine-tune this component for the sake 
of computational efficiency and generalization ability. 

We employ the chessboard distance for Eq. and LI- 
BLINEAR m as our SVM solver with hinge loss. We 
randomly generate about 3 x 10^ training samples to learn 
model parameters v^’s. The regularization parameters are 
determined by cross-validation. 

3.1. Person Re-identification 

Eor performance measure we adopt the standard Cumu¬ 
lative Match Characteristic (CMC) curve, which displays 
the recognition rate as a function of rank. The recognition 
rate at rank-r is the proportion of queries correctly matched 
to a corresponding gallery entity at rank-r or better. 

Eor tasks with multiple camera views, we follow 0 to 
compare results under two camera views. Consider the 
results from multiple views as a high dimensional tensor, 
one dimension per view. To predict pairwise matches from 
multi-view results (e.g. identifying matches between cam¬ 
era view 1 and view 2 from the predicted results for the 
joint of view 1, 2, and 3), we can either sum over or find 
the maximum over the extra dimensions. Cross-validation 
is used to choose the better way for each dataset. 


Table 1. Matching rate comparison (%) on VIPeR and CUHKOl. 


Rank r = 

1 

5 

10 

15 

20 

25 


VIPeR 

SCNCD 1311 

20.7 

47.2 

60.6 

68.8 

75.1 

79.1 

SCNCD finaZ O 

37.8 

68.5 

81.2 

87.0 

90.4 

92.7 

LADFfl^ 

29.3 

61.0 

76.0 

83.4 

88.1 

90.9 

Mid-level filters 1361 

29.1 

52.3 

65.9 

73.9 

79.9 

84.3 

Mid-level+LADF 1361 

43.4 

73.0 

84.9 

90.9 

93.7 

95.5 

VW-CooC (32] 

30.70 

62.98 

75.95 

81.01 

- 

- 

Ours 

33.5 

59.5 

72.8 

81.3 

88.0 

89.6 


CUHKOl 

Single-shot LAFT" 1181 ' 

25.8 

55.0 

66.7 

73.8 

79.0 

83.0 

Multi-shot LAFT* [Ts] 

31.4 

58.0 

68.3 

74.0 

79.0 

83.0 

Mid-level filters 

34.3 

55.1 

65.0 

71.0 

74.9 

78.0 

VW-CooC (32) 

44.03 

70.47 

79.12 

84.77 

- 

- 

Ours 

60.39 

82.92 

90.43 

93.42 

94.55 

95.78 


3.1.1 Two Camera Views 

Person Re-ID between two views is the simplest scenario. 
We test our method on the VIPeR lfT3]| and CUHK Cam¬ 
pus (351 dataset. We extract a 672-dim Color-FSIPlj^ vector 
from each 5x5 pixel patch in images as low-level features. 
We follow the experimental setting in EU for both datasets. 

Our comparison results are listed in Table As we see, 
on VIPeR “Mid-level-i-LADP” from (361 is the current best 
method, which utilized more discriminative mid-level filters 
as features and a powerful classifier, and “SCNCDj^^af’ 
from ED is the second, which utilized only foreground fea¬ 
tures. Our results are comparable to both of them. How¬ 
ever, our method always outperforms their original meth¬ 
ods significantly when either the powerful classifier or the 
foreground information is not involved. On CUHKOl, 
our method performs the best. At rank-1, it outperforms 
(3^ 1^ by 16.36% and 26.09%, respectively. Compared 
with (^, the improvement mainly comes from the multi¬ 
ple instance setting of our method. 

The CMC curve comparison on VIPeR and CUHKOl is 
shown in Pig. As we see, our curve is very similar to that 
of LADE. This is mainly because LADE is a second-order 
{i.e. quadratic) decision function based on metric learning, 
which shares some commonality with our classifiers. 

We also demonstrate the impacts of different numbers 
of pixel locations {i.e. view-shared space) and visual words 
{i.e. view-specific space) on the performance using VIPeR 
in Pig. We sample the pixel locations, step by from 1 to 5 
pixels along x and y-axis in images (larger number leading 
to fewer samples), while using different numbers of visual 
words. Visual words capture the variations in appearance, 
and with more visual words more similar patterns can be 
differentiated {e.g. pink and red). Matching between pixel 
locations gives us the statistic information of visual words, 
and more samples make the statistics more robust. Together 
they work for good performance. 


^We downloaded the code from https://github.com/ 
Robert0812/salience_match 
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Figure 4. CMC curve comparison on (a) VIPeR and (b) CUHKOl, respectively. Notice that except our results, the rest are copied from (21 • 
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Camera Pair 1 - 2 



Cumulative Matching Characteristic (CMC) 
Camera Pair 1 - 3 



Cumulative Matching Characteristic (CMC) 
Camera Pair 2 - 3 



Figure 6. CMC curve comparison on WARD. Note except for our results, the other results are cited from (3. 
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Figure 5. Demonstration of the impacts of different numbers of pixel lo¬ 
cations and visual words on the performance using VIPeR. Warmer color 
demotes higher accuracy. This hgure is best viewed in color. 

3.1.2 Three Camera Views 

Now we consider three camera views, and test our method 
on the WARD dataset 1241. Following |[5|, we denote the 
camera views as view 1, 2 and 3. However, for pairwise 
view matching, 0 did not mention which view as probe 
or gallery. Here, we define the view with a smaller/larger 
number of data to be the gallery/probe set. We randomly 
select 35 people for training, and the rest for testing. 

We first resize each image to the same 128 x 64 pixels, 
and take every 2x2 pixel patch in the HSV color space to 
generate our low-level features by concatenating 3x2x2 = 
12 entries into a vector. The reason for choosing this feature 
is because in m the features were built in the HSV color 
space as well. Different from 0, we take the whole image 
to generate features without foreground segmentation. 


Table 2. AUC comparison (%) on WARD based on Fig.[^ 


View pair 

1-2 

1-3 

2-3 

Ave. 

FT 

93.3 

91.0 

94.9 

93.1 

NCR on ICT 

90.4 

84.8 

91.1 

88.7 

NCR on FT 

95.4 

91.9 

95.6 

94.3 

Ours: Multi-view 

94.4 

92.1 

98.1 

94.9 

Ours: Double-view 

92.7 

91.0 

97.5 

93.8 


The results are shown in Fig. As we see, our method 
performs similar or better than NCR O, and the curves of 
both the multi-view training and double-view training for 
our method behave very similarly. We list the area under 
curve (AUC) scores in Table Our method is better than 
NCR on FT by 0.6%, on average, from 94.3% to 94.9%. 

3.1.3 Four Camera Views 

Next we consider four camera views, and test our method 
on the Re-identification Across indoor-outdoor Dataset 
(RAiD) 0 with two indoor views camera 1 and 2, and two 
outdoor views camera 3 and 4. Still we take the views with 
smaller/larger numbers as galleries/probes. We follow (Si, 
and utilize the same HSV low-level feature as we did in 
Section l3.1.2l 

Our comparison results are shown in Fig. As we see, 
our method again performs equally well or better than NCR. 
We list the AUC score comparison results in Table Still 
our method is better than NCR on FT by 1.6%, on average, 
from 94.7% to 96.3%. 









































Cumulative Matching Characteristic (CMC) 
Camera Pair 1 - 2 (indoor-indoor) 



Cumulative Matching Characteristic (CMC) 
Camera Pair 1 - 3 (indoor-outdoor) 



Cumulative Matching Characteristic (CMC) 
Camera Pair 1 - 4 (indoor-outdoor) 



Cumulative Matching Characteristic (CMC) 
Camera Pair 2-3 (indoor-outdoor) 



Cumulative Matching Characteristic (CMC) 
Camera Pair 2-4 (indoor-outdoor) 



Cumulative Matching Characteristic (CMC) 
Camera Pair 3 - 4 (outdoor-outdoor) 



Figure 7. CMC curve comparison on RAiD. Notice that expect our results, the rest results are cited from O. 


Table 3. AUC comparison (%) on RAiD based on Fig.^ 


View pair 

1-2 

1-3 

1-4 

2-3 

2-4 

3-4 

Ave. 

FT 

96.6 

84.3 

88.8 

90.0 

93.9 

93.5 

91.2 

NCR on ICT 

98.5 

90.6 

92.1 

91.0 

94.4 

94.1 

93.4 

NCR on FT 

98.1 

90.4 

93.1 

94.5 

96.5 

95.9 

94.7 

Ours: Multi-view 

98.2 

93.0 

97.1 

94.1 

96.6 

90.5 

94.9 

Ours: Double-view 

99.3 

90.8 

98.3 

93.0 

98.0 

98.8 

96.3 


For both indoor-indoor and outdoor-outdoor cases, our 
method consistently works best, which may indicate that the 
visual word co-occurrence patterns are more discriminative 
if the lighting condition is similar. 

3.2. Kinship Verification & Identification 

As before, we utilize the HSV 12-dim low-level fea¬ 
tures. In the experiments, we denote father, mother, son, 
and daughter as F, M, S, and D, respectively. Following 
(231, we measure the verification performance with the ver¬ 
ification rate, defined by the number of correctly classified 
face pairs divided by the total number of face pairs in the 
test set. For identification, CMC curves are also used. We 
only use double-view training in this task since the informa¬ 
tion captured by parent-offspring pairs are more important. 

Kinship verification between two views (one parent and 
one offspring) is the conventional setting, where we test 
our method on two datasets, i.e. KinFaceW-I (^ and 
KinFaceW-II (23l. The former consists of 156 FS, 134 
FD, 116 MS and 127 MD pairs, while the latter contains 
250 pairs of each kin relation. The main difference be¬ 
tween the two datasets is that each pair of face images in 
KinFaceW-II comes from the same photo while the image 


pairs in KinFace-I come from different photos. We follow 
the same protocol as that in (23]|6l|28l and use a 5-fold cross 
validation with balanced positive and negative pairs on the 
default training/testing split. Results are listed in Table 

On KinFaceW-II, our method significantly outperforms 
the competitors, but on KinFaceW-I ours is slightly worse. 
Our reasoning is that our current visual word representa¬ 
tion using simple K-Means does not account for significant 
visual ambiguity in appearance when imaging factors (e.g. 
lighting conditions, illumination, etc.) change substantially. 
This leads to large intra-cluster variations in visual words 
that our method does not currently handle well. To further 
investigate the different performances on both datasets, we 
use a smaller training set randomly sampled on KinFaceW- 
II such that it has the same size as KinFaceW-I, while keep¬ 
ing the same test set and record the results as “reduced 
training set”. The results become slightly worse than the 
original training set, while still outperform other methods. 
These relatively good results, along with the worse results 
on KinFaceW-I, demonstrate that the size of training data is 
indeed important, but less important than the data sources. 

Next we use TSKinFace dataset (2^ for three-view kin¬ 
ship verification (i.e. father, mother, offspring), which con¬ 
tains 513 FM-S and 502 FM-D groups. Following (281, we 
carry out a 5-fold cross validation with balanced positive 
and negative samples , and list the results in Table |5|As we 
see, our method performs consistently better than (^ . 

Finally we employ the Family 101 dataset O to inves¬ 
tigate kinship identification, namely, identifying the cor- 




























Table 4. Verification rate comparison (%) on KinFaceW 



FS 

FD 

MS 

MD 

Mean 


KinFaceW-I 

Dehghan et al. (6] 

76.4 

72.5 

71.9 

77.3 

74.5 

Lu et al. (23 

72.5 

66.5 

66.2 

72.0 

69.9 

Qin et al. (23 

76.8 

76.8 

74.6 

78.0 

76.6 

Ours 

63.5 

65.0 

63.8 

75.6 

67.0 


KinFaceW-II 

Dehghan et al. (3 

83.9 

76.7 

83.4 

84.8 

82.2 

Lu et al. (23 

76.9 

74.3 

77.4 

77.6 

76.5 

Qin et al. 1281 

84.6 

77.0 

84.4 

85.4 

82.9 

Ours 

85.4 

81.8 

86.6 

90.0 

86.0 

Ours (reduced training set) 

84.4 

78.2 

84.6 

87.8 

83.8 


Table 5. Verification rate comparison (%) on TSKinFace. 



FS 

FD 

MS 

MD 

FM-S 

FM-D 

Dehghan et al. (3 

79.9 

74.2 

78.5 

76.3 

81.9 

79.6 

Fang et al. (3 

69.1 

66.8 

68.7 

67.9 

71.6 

69.8 

Lu et al. 1231 

74.8 

70.0 

72.2 

71.3 

77.0 

71.4 

Qin et al. (23 

83.0 

80.5 

82.8 

81.1 

86.4 

84.4 

Ours 

88.5 

87.0 

87.9 

87.8 

90.6 

89.0 


Table 6. AUC comparison (%) on Family 101. 



FS 

FD 

MS 

MD 

Mean 

Dehghan et al. (3 

88.8 

91.3 

94.3 

96.4 

92.7 

Ours 

90.3 

94.6 

96.0 

97.0 

94.5 


Family 101 



Rank 


Figure 8. CMC curve comparison with O on Family 101. 

reel parent/child among a set of candidates given one 
child/parent image. This dataset contains 14816 images that 
form 206 nuclear families belonging to 101 unique family 
trees. Following |[6l, we adopt 101 nuclear families and use 
50 families for training and 51 families for testing. For each 
of the four kin relations, we train a model and use the model 
to match offering images to all possible parent images. The 
CMC curve^are shown in Fig. |^, and Tablej^lists the Area 
Under Curve (AUC) measure of the CMC curves. 

3.3. Storage & Computational Time 

Storage (St for short) and computational time during 
testing are two critical issues in real-world applications. In 
our method, we only need to store a feature matrix for each 
entity based on Eq. which is used to calculate similari¬ 
ties between different entities. The computational time can 
be roughly divided into two parts: (1) computing feature 
matrices Ti, and (2) predicting group membership T 2 . We 
do not consider the time for generating low-level features, 
since different implementations vary significantly. 


Table 7. Average storage and computational time for our mdthod. 



St (Kb) 

Ti (ms) 

T 2 (ms) 

VIPeR 

110.7 

52.9 

0.6 

WARD 

113.7 

99.7 

1.5 

RAiD 

166.5 

68.7 

0.5 


We record the storage and computational time using 300 
visual words for both probe and gallery sets on VIPeR (two 
views), WARD (three views), and RAiD (four views). The 
rest of the parameters are the same as described in Section 


3.1 As we see, the storage per data sample and computa¬ 


tional time are linearly proportional to the size of images 
and number of visual words. Our implementation is based 
on unoptimized MATLAB cod^ Numbers are listed in Ta¬ 
ble 1^ including the time for saving and loading features. 
Our experiments were all run on a multi-thread CPU (Xeon 
E5-2696 v2) with a GPU (GTX TITAN). The method runs 
efficiently with very low demand for storage. 


4. Conclusion 

In this paper, we propose a general parametric proba¬ 
bility model for the group membership prediction (GMP) 
problem. We introduce the notions of view-specific and 
view-shared latent variables to capture visual information 
and commonality for each view. Using these two variables, 
we can factorize the group membership score into a tensor 
product, and thus propose a new visual word co-occurrence 
tensor feature to represent groups of data samples. In our 
parametric probability model, we can handle the multiple 
instance cases as well. Eurther we propose discriminatively 
learning a bilinear classifier for GMP, with the decision 
function as the marginalization over all latent variables. Our 
experiments on multi-camera person re-id and kinship ver¬ 
ification tasks demonstrate the good predictive ability and 
computational efficiency of our method. As future work, 
we would like to explore other applications for our method 
such as activity retrieval (H, and develop new approaches 
such as zero-shot recognition El and structured learning 
El for our problem. 
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